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Abstract 

Tor’s growing popularity and user diversity has re- 
sulted in network performance problems that are not 
well understood. A large body of work has attempted 
to solve these problems without a complete understand- 
ing of where congestion occurs in Tor. In this paper, 
we first study congestion in Tor at individual relays as 
well as along the entire end-to-end Tor path and find 
that congestion occurs almost exclusively in egress ker- 
nel socket buffers. We then analyze Tor’s socket interac- 
tions and discover two major issues affecting congestion: 
Tor writes sockets sequentially, and Tor writes as much 
as possible to each socket. We thus design, implement, 
and test KIST: a new socket management algorithm that 
uses real-time kernel information to dynamically com- 
pute the amount to write to each socket while consider- 
ing all writable circuits when scheduling new cells. We 
find that, in the medians, KIST reduces circuit conges- 
tion by over 30 percent, reduces network latency by 18 
percent, and increases network throughput by nearly 10 
percent. We analyze the security of KIST and find an ac- 
ceptable performance and security trade-off, as it does 
not significantly affect the outcome of well-known la- 
tency and throughput attacks. While our focus is Tor, 
our techniques and observations should help analyze and 
improve overlay and application performance, both for 
security applications and in general. 

1 Introduction 

Tor [21] is the most popular overlay network for com- 
municating anonymously online. Tor serves millions of 
users daily by transferring their traffic through a source- 
routed circuit of three volunteer relays, and encrypts the 
traffic in such a way that no one relay learns both its 
source and intended destination. Tor is also used to resist 
online censorship, and its support for hidden services, 
network bridges, and protocol obfuscation has helped at- 
tract a large and diverse set of users. 


While Tor’s growing popularity, variety of use cases, 
and diversity of users have provided a larger anonymity 
set, they have also led to performance issues [23]. For 
example, it has been shown that roughly half of Tor’s 
traffic can be attributed to BitTorrent [18,43], while the 
more recent use of Tor by a botnet [29] has further in- 
creased concern about Tor’s ability to utilize volunteer 
resources to handle a growing user base [20, 36,37,45], 

Numerous proposals have been made to battle Tor’s 
performance problems, some of which modify the 
mechanisms used for path selection [13,59,60], client 
throttling [14, 38, 45], circuit scheduling [57], and 
flow/congestion control [15], While some of this work 
has or will be incorporated into the Tor software, none of 
it has provided a comprehensive understanding of where 
the most significant source of congestion occurs in a 
complete Tor deployment. This lack of understanding 
has led to the design of uninformed algorithms and spec- 
ulative solutions. In this paper, we seek a more thor- 
ough understanding of congestion in Tor and its effect on 
Tor’s security. We explore an answer to the fundamental 
question — “ Where is Tor slow?” — and design informed 
solutions that not only decrease congestion, but also im- 
prove Tor’s ability to manage it as Tor continues to grow. 
Congestion in Tor: We use a multifaceted approach to 
exploring congestion. First, we develop a shared library 
and Tor software patch for measuring congestion local to 
relays running in the public Tor network, and use them 
to measure congestion from three live relays under our 
control. Second, we develop software patches for Tor 
and the open-source Shadow simulator [7], and use them 
to measure congestion along the full end-to-end path in 
the largest known, at-scale, private Shadow-Tor deploy- 
ment. Our Shadow patches ensure that our congestion 
measurements are accurate and realistic; we show how 
they significantly improve Shadow’s TCP implementa- 
tion, network topology, and Tor models. 1 

*We have contributed our patches to the Shadow project [7] and 
they have been integrated as of Shadow release 1.9.0. 
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To the best of our knowledge, we are the first to con- 
sider such a comprehensive range of congestion infor- 
mation that spans from individual application instances 
to full network sessions for the entire distributed system. 
Our analysis indicates that congestion occurs almost 
exclusively inside of the kernel egress socket buffers, 
dwarfing the Tor and the kernel ingress buffer times. This 
finding is consistent among all three public Tor relays we 
measured, and among relays in every circuit position in 
our private Shadow-Tor deployment. This result is sig- 
nificant, as Tor does not currently prevent, detect, or oth- 
erwise manage kernel congestion. 

Mismanaged Socket Output: Using this new under- 
standing of where congestion occurs, we analyze Tor’s 
socket output mechanisms and find two significant and 
fundamental design issues: Tor sequentially writes to 
sockets while ignoring the state of all sockets other than 
the one that is currently being written; and Tor writes as 
much as possible to each socket. 

By writing to sockets sequentially. Tor’s circuit sched- 
uler considers only a small subset of the circuits with 
writable data. We show how this leads to improper uti- 
lization of circuit priority mechanisms, which causes Tor 
to send lower priority data from one socket before higher 
priority data from another. This finding confirms evi- 
dence from previous work indicating the ineffectiveness 
of circuit priority algorithms [35], 

By writing as much as possible to each socket. Tor is 
often delivering to the kernel more data than it is capable 
of sending due to either physical bandwidth limitations 
or throttling by the TCP congestion control protocol. Not 
only does writing too much increase data queuing de- 
lays in the kernel, it also further reduces the effectiveness 
of Tor’s circuit priority mechanisms because Tor relin- 
quishes control over the priority of data after it is deliv- 
ered to the kernel. 2 This kernel overload is exacerbated 
by the fact that a Tor relay may have thousands of sock- 
ets open at any time in order to facilitate data transfer 
between other relays, a problem that may significantly 
worsen if Tor adopts recent proposals [16, 26] that sug- 
gest increasing the number of sockets between relays. 
KIST: Kernel-Informed Socket Transport: To solve 
the socket management problems outlined above, we de- 
sign KIST: a Kernel-Informed Socket Transport algo- 
rithm. KIST has two features that work together to sig- 
nificantly improve Tor’s control over network conges- 
tion. First, KIST changes Tor’s circuit level scheduler so 
that it chooses from all circuits with writable data rather 
than just those belonging to a single TCP socket. Second, 
to complement this global scheduling approach, KIST 
also dynamically manages the amount of data written to 
each socket based on real-time kernel and TCP state in- 

-To the best of our knowledge, the Linux kernel uses a variant of 
the first-come first-serve queuing discipline among sockets. 


formation. In this way, KIST attempts to minimize the 
amount of data that exists in the kernel that cannot be 
sent, and to maximize the amount of time that Tor has 
control over data priority. 

We perform in-depth experiments in our at-scale pri- 
vate Shadow-Tor network, and we show how KIST can 
be used to relocate congestion from the kernel into Tor, 
where it can be properly managed. We also show how 
KIST allows Tor to correctly utilize its circuit priority 
scheduler, reducing download latency by over 660 mil- 
liseconds, or 23.5 percent, for interactive traffic streams 
typically generated by web browsing behaviors. 

We analyze the security of KIST, showing how it af- 
fects well-known latency and throughput attacks. In par- 
ticular, we show the extent to which the latency improve- 
ments reduce the number of round-trip time measure- 
ments needed to conduct a successful latency attack [31], 
We also show how KIST does not significantly affect an 
adversary’s ability to collect accurate measurements re- 
quired for the throughput correlation attack [44] when 
compared to vanilla Tor. 

Outline of Major Contributions: We outline our major 
contributions as follows: 

- in Section 3 we discuss improvements to the open- 
source Shadow simulator that significantly enhance 
its accuracy, including experiments with the largest 
known private Tor network of 3,600 relays and 
13,800 clients running real Tor software; 

- in Section 4 we discuss a library we developed to 
measure congestion in Tor, and results from the first 
known end-to-end Tor circuit congestion analysis; 

- in Section 5 we show how Tor’s current management 
of sockets results in ineffective circuit priority, detail 
the KIST design and prototype, and show how it im- 
proves Tor’s ability to manage congestion through a 
comprehensive and full-network evaluation; and 

- in Section 6 we analyze Tor’s security with KIST by 
showing how our performance improvements affect 
well-known latency and throughput attacks. 

2 Background and Related Work 

Tor [21] is a volunteer-operated anonymity service 
used by an estimated hundreds of thousands of daily 
users [28], Tor assumes an adversary who can monitor 
a portion of the underlying Internet and/or operate Tor 
relays. People primarily use Tor to prevent an adversary 
from discovering the endpoints of their communications, 
or disrupting access to information. 

Tor Traffic Handling: Tor provides anonymity by form- 
ing source-routed paths called circuits that consist of 
(usually) three relays on an overlay network. Clients 
transfer TCP-based application traffic within these cir- 
cuits; encrypted application-layer headers and payloads 
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Figure 1: Internals of cell processing within in a Tor relay. 
Dashed lines denote TCP connections. Transitions between 
buffers — both within the kernel (side boxes) and within Tor 
(center box) — are shown with solid arrows. 

make it more difficult for an adversary to discern an inter- 
cepted communication’s endpoints or learn its plaintext. 

A given circuit may carry several Tor streams, which 
are logical connections between clients and destinations. 
For example, a HTTP request to usenix.org may result 
in several Tor streams (i.e., to fetch embedded objects); 
these streams may all be transported over a single circuit. 
Circuits are themselves multiplexed over TLS connec- 
tions between relays whenever their paths share an edge; 
that is, all concurrent circuits between relays u and v will 
be transferred over the same TLS connection between the 
two relays, irrespective of the circuits’ endpoints. 

The unit of transfer in Tor is a 512-byte cell. Figure 1 
depicts the internals of cell processing within a Tor re- 
lay. In this example, the relay maintains two TLS con- 
nections with other relays. Incoming packets from the 
two TCP streams are first demultiplexed and placed into 
kernel socket input buffers by the underlying OS (Fig- 
ure la) 3 . The OS processes the packets, usually in FIFO 
order, delivering them to Tor where they are reassem- 
bled into TLS-encrypted cells using dedicated Tor in- 
put buffers (Figure lb). Upon receipt of an entire TLS 
datagram, the TLS layer is removed, the cell is onion- 
crypted, 4 and then transferred and enqueued in the ap- 
propriate Tor circuit queue (Figure lc). Each relay main- 
tains a queue for each circuit that it is currently serving. 
Cells from the same Tor input buffer may be enqueued in 
different circuit queues, since a single TCP connection 
between two relays may carry multiple circuits. 

Tor uses a priority-based circuit scheduling approach 
that attempts to prioritize interactive web clients over 
bulk downloaders [57], The circuit scheduler selects a 
cell from a circuit queue to process based on this prioriti- 
zation, onion-crypts the cell, and stores it in a Tor output 
buffer (Figure Id). Once the Tor output buffer contains 
sufficient data to form a TLS packet, the data is written 
to the kernel for transport (Figure If). 

Improving Tor Performance: There is a large body 
of work that attempts to improve Tor’s network perfor- 

3 For simplicity, we consider only relays that run Linux since such 
relays represent 75% of all Tor relays and contribute 91% of the band- 
width of the live Tor network [58], 

4 Encrypted or decrypted, depending on circuit direction. 


mance, e.g., by refining Tor’s relay selection strategy 
[1 1,55,56] or providing incentives to users to operate Tor 
relays [36,37,45]. These approaches are orthogonal and 
can be applied in concert with our work, which focuses 
on improving Tor’s congestion management. 

Most closely related to this paper are approaches 
that modify Tor’s circuit scheduling, flow control, or 
transport mechanisms. Reardon and Goldberg suggest 
replacing Tor’s TCP-based transport mechanism with 
UDP-based DTLS [54], while Mathewson explores us- 
ing SCTP [40], Murdoch [47] explains that the UDP 
approach is promising, but there are challenges that 
have thus far prevented the approach from being de- 
ployed: there is limited kernel support for SCTP; and 
the lack of hop-by-hop reliability from UDP-based trans- 
ports causes increased load at Tor’s exit relays. Our work 
allows Tor to best utilize the existing TCP transport in the 
short term while work toward a long term UDP deploy- 
ment strategy continues. 

Tang and Goldberg propose the use of the exponential 
weighted moving average (EWMA) to characterize cir- 
cuits’ recent levels of activity, with bursty circuits given 
greater priority than busy circuits [57] (to favor interac- 
tive web users over bulk downloaders). Unfortunately, 
although Tor has adopted EWMA, the network has not 
significantly benefitted from its use [35]. In our study of 
where Tor is slow, we show that EWMA is made ineffec- 
tive by Tor’s current management of sockets, and can be 
made effective through our proposed modifications. 

AlSabah et al. propose an ATM-like congestion and 
flow control system for Tor called N23 [15], Their ap- 
proach causes pushback effects to previous nodes, reduc- 
ing congestion in the entire circuit. Our KIST strategy is 
complementary to N23, focusing instead on local tech- 
niques to remove kernel-level congestion at Tor relays. 

Torchestra [26] uses separate TCP connections to 
carry interactive and bulk traffic, isolating the effects of 
congestion between the two traffic classes. Conceptually, 
Torchestra moves circuit-selection logic to the kernel, 
where the OS schedules packets for the two connections. 
Relatedly, AlSabah and Goldberg introduce PCTCP [16], 
a transport mechanism for Tor in which each circuit is as- 
signed its own IPsec tunnel. In this paper, we argue that 
overloading the kernel with additional sockets reduces 
the effectiveness of circuit priority mechanisms since the 
kernel has no information regarding the priority of data. 
In contrast, we aim to move congestion management to 
Tor, where priority scheduling can be most effective. 

Nowlan et al. [50] propose the use of uTCP and 
uTLS [49] to tackle the “head-of-line” blocking problem 
in Tor. Here, they bypass TCP’s in-order delivery mech- 
anism to peek at traffic that has arrived but is not ready 
to be delivered by the TCP stack (e.g., because an earlier 
packet was dropped). Since Tor multiplexes multiple cir- 
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cuits over a single TCP connection, their technique offers 
significant latency improvements when connections are 
lossy, since already-arrived traffic can be immediately 
processed. Our technique can be viewed as a form of 
application-layer head-of-line countermeasure since we 
move scheduling decisions from the TCP stack to within 
Tor. In contrast to Nowlan et al.’s approach, we do not re- 
quire any kernel-level modifications or changes to Tor’s 
transport mechanism. 

3 Enhanced Network Experimentation 

To increase confidence in our experiments, we introduce 
three significant enhancements to the Shadow Tor simu- 
lator [35] and its existing models [33]: a more realistic 
simulated kernel and TCP network stack, an updated In- 
ternet topology model, and the largest known deployed 
private Tor network. The enhancements in this section 
represent a large and determined engineering effort; we 
will show how Tor experimental accuracy has signifi- 
cantly benefited as a result of these improvements. We 
remark that our improvements to Shadow will have an 
immediate impact beyond this work to the various re- 
search groups around the world that use the simulator. 
Shadow TCP Enhancements: After reviewing 

Shadow [7], we first discovered that it was missing 
many important TCP features, causing it to be less 
accurate than desired. We enhanced Shadow by adding 
the following: retransmission timers [52], fast retrans- 
mit/recovery [12], selective acknowledgments [42], and 
forward acknowledgments [41], Second, we discovered 
that Shadow was using a very primitive version of the 
basic additive-increase multiplicative -decrease (AIMD) 
congestion control algorithm. We implemented a much 
more complete version of the CUBIC algorithm [27], the 
default congestion control algorithm used in the Linux 
kernel since version 2.6.19. CUBIC is an important algo- 
rithm for properly adjusting the congestion window. We 
will show how our implementation of these algorithms 
greatly enhance Shadow’s accuracy, which is paramount 
to the remainder of this paper. See Appendix A.l [34] 
for more details about our modifications. 

We verify the accuracy of Shadow’s new TCP imple- 
mentation to ensure that it is adequately handling packet 
loss and properly growing the congestion window by 
comparing its behavior to ns [5 1], a popular network sim- 
ulator, because of the ease at which ns is able to model 
packet loss rates. In our first experiment, both Shadow 
and ns have two nodes connected by a 10 MiB/s link 
with a 10 ms round trip time. One node then down- 
loads a 100 MiB hie 10 times for each tested packet loss 
rate. Figure 2a shows that the average download time in 
Shadow matches well with ns over varying packet loss 
rates. Although not presented here, we similarly vali- 


dated Shadow with our changes against a real network 
link using the bandwidth and packet loss rate that was 
achieved over our switch; the results did not significantly 
deviate from those presented in Figure 2a. 

For our second experiment, we check that the growth 
of the congestion window using CUBIC is accurate. 
We first transfer a 100 MiB hie over a 100 Mbit/s link 
between two physical Ubuntu 12.04 machines running 
the 3.2.0 Linux kernel. We record the cwnd (con- 
gestion window) and ssthresh (slow start threshold) 
values from the getsockopt function call using the 
TCP_INFO option. We then run an identical experiment 
in Shadow, setting the slow start threshold to what we 
observed from Linux and ensuring that packet loss hap- 
pens at roughly the same rate. Figure 2b shows the value 
of cwnd in both Shadow and Linux over time, and we 
see almost identical growth patterns. The slight varia- 
tion in the saw-tooth pattern is due to unpredictable vari- 
ation in the physical link that was not reproduced by 
Shadow. As a result. Shadow’s cwnd grew slightly faster 
than Linux’s because Shadow was able to send one ex- 
tra packet. We believe this is an artifact of our particular 
physical configuration and do not believe it significantly 
affects simulation accuracy in general: more importantly, 
the overall saw-tooth pattern matches well. 

The two experiments discussed above give us high 
confidence that our TCP implementation is accurate, 
both in responding to packet loss and in operation of the 
CUBIC congestion control algorithm. 

Shadow Topology Enhancements: To ensure that we 
are causing the most realistic performance and con- 
gestion effects possible during simulation, we enhance 
Shadow using techniques from recent research in mod- 
eling Tor topologies [39, 59], traceroute data from 
CAIDA [2], and client/server data from the Tor Metrics 
Portal [8] and Alexa [1]. This data-driven Internet map 
is more realistic than the one Shadow provides, and in- 
cludes 699,029 vertices and 1,338,590 edges. For space 
reasons, we provide more details in Appendix A. 2 [34]. 
Tor Model: Using Shadow with the improvements dis- 
cussed above, we build a Tor model that reflects the real 
Tor network as it existed in July 2013, using the then- 
latest stable Tor version 0.2.3.25. (We use this model 
for all experiments in this paper.) Using data from the 
Tor Metrics Portal [8], we configure a complete, private 
Tor network following Tor modeling best practices [33], 
and attach every node to the closest network location in 
our topology map. The resulting Tor network config- 
uration includes 10 directory authorities, 3,600 relays, 
13,800 clients, and 4,000 file servers — the largest known 
working private experimental Tor network, and the first 
to run at scale to the best of our knowledge. 

The 13,800 clients in our model provide background 
traffic and load on the network. 10,800 of our clients 
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Figure 2: Figure 2a compares Shadow to ns download times. Figure 2b compares congestion window over time when Shadow and 
Linux have the same link properties. Figure 2c compares Shadow-Tor to live Tor measurements collected from Tor Metrics [8], 


download 320 KiB files (the median size of a web page 
according to the most recent web statistics published by 
a Google engineer [53]) and then wait for a time cho- 
sen uniformly at random from the range [1, 60000] mil- 
liseconds after each completed download. 1,200 of our 
clients repeatedly download 5 MiB files with no pauses 
between completing a download and starting the next. 
The ratio of these client behaviors was chosen accord- 
ing to the latest known measurements of client traffic 
on Tor [18,43]. Shadow also contains 1,800 TorPerf [9] 
clients that download a file over a fresh circuit and pause 
for 60 seconds after each successful download. (TorPerf 
is a tool for measuring Tor performance.) 600 of the Tor- 
Perf clients download 50 KiB files, 600 download 1 MiB 
files, and 600 download 5 MiB files. Our simulations run 
for one virtual hour during each experiment. 

Figure 2c shows a comparison of publicly available 
TorPerf measurements collected on the live Tor net- 
work [8] to those collected in our private Shadow-Tor 
network. As shown in Figure 2c, our full size Shadow- 
Tor network is extremely accurate in terms of time to 
complete downloads for all file sizes. These results give 
us confidence that our at-scale Shadow-Tor network is 
strongly representative of the deployed Tor network. 

4 Congestion Analysis 

In this section, we explore where congestion happens in 
Tor through a large scale congestion analysis. We take a 
multifaceted approach by measuring congestion as it oc- 
curs in both the live, public Tor network, and in an exper- 
imental, private Tor network running in Shadow. By an- 
alyzing relays in the public Tor network, we get the most 
realistic and accurate view of what is happening at our 
measured relays. We supplement the data from a rela- 
tively small public relay sample with measurements from 
a much larger set of private relays, collecting a larger and 
more complete view of Tor congestion. 


To understand congestion, we are interested in mea- 
suring the time that data spends inside of Tor as well as 
inside of kernel sockets in both the incoming and outgo- 
ing directions. We will discuss our findings in both envi- 
ronments after describing the techniques that we used to 
collect the time spent in these locations. 

4.1 Congestion in the Live Tor Network 

Relays running in the operational network provide the 
most accurate source of congestion data, as these relays 
are serving real clients and transferring real traffic. As 
mentioned above, we are interested in measuring queu- 
ing times inside of the Tor application as well as inside 
of the kernel, and so we developed techniques for both in 
the local context of a public Tor relay. 

Tor Congestion: Measuring Tor queuing times requires 
some straightforward modifications to the Tor software. 
As soon as a relay reads the entire cell, it internally cre- 
ates a cell structure that holds the cell’s circuit ID, com- 
mand, and payload. We add a new unique cell ID value. 
Whenever a cell enters Tor and the cell structure is cre- 
ated, we log a message containing the current time and 
the cell’s unique ID. The cell is then switched to the out- 
going circuit. After it’s sent to the kernel we log another 
message containing the time and ID. The difference be- 
tween these times represents Tor application congestion. 
Kernel Congestion: Measuring kernel queuing times 
is much more complicated since Tor does not have di- 
rect access to the kernel internals. In order to log the 
times when a piece of data enters and leaves the ker- 
nel in both the incoming and outgoing directions, we 
developed a new, modular, application-agnostic, multi- 
threaded library, called libkqtime. 5 libkqtime 
uses libpcap [6] to determine when data crosses the 
host/network boundary, and function interposition on 

5 libkqtime was written in 770 LOC, and is available for down- 
load as open source software [5], 
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Figure 3: The distribution of congestion inside Tor and the kernel on our 3 relays running in the public Tor network, as measured 
between 2014-01-20 and 2014-01-28. Most congestion occurred in the outbound kernel queues on all three relays. 


the write ( ) , send ( ) , read ( ) , and recv ( ) func- 
tions to determine when it crosses the application/kernel 
boundary. The library copies a 16 byte tag as data enters 
the kernel from either end, and then searches for the tag 
as data leaves the kernel on the opposite end. This pro- 
cess works in both directions, and the timestamps col- 
lected by the library allow us to measure both inbound 
and outbound kernel congestion. Appendix B [34] gives 
a more detailed description of libkqtime. 

Results: To collect congestion information in Tor, we 
first ran three live relays (curiosity 1, curiosity2, and 
curiosity3) using an unmodified copy of Tor release 
0.2.3.25 for several months to allow them to stabi- 
lize. We configured them as non-exit nodes and used a 
network appliance to rate limit curiosityl at 1 Mbit/s, cu- 
riosity2 at 10 Mbit/s, and curiosity3 at 50 Mbit/s. Only 
curiosity2 had the guard flag (could be chosen as en- 
try relay for a circuit) during our data collection. On 
2014-01-20, we swapped the Tor binary with a version 
linked to libkqtime and modified as discussed in Sec- 
tion 4.1. We collected Tor and kernel congestion for 190 
hours (just under 8 days) ending on 2014-01-28, and then 
replaced the vanilla Tor binary. 

The distributions of congestion as measured on each 
relay during the collection period are shown in Figure 3 
with logarithmic x-axes. Our measurements indicate that 
most congestion, when present, occurs in the kernel out- 
bound queues, while kernel inbound and Tor congestion 
are both less than 1 millisecond for over 95 percent of our 
measurements. This finding is consistent across all three 
relays we measured. Kernel outbound congestion in- 
creases from curiosityl to curiosity2, and again slightly 
from curiosity2 to curiosity3, indicating that congestion 
is a function of relay capacity or load. We leave it to fu- 
ture work to analyze the strength of this correlation, as 
that is outside the scope of this paper. 

Ethical Considerations: We took careful protections to 
ensure that our live data collection did not breach users’ 
anonymity. In particular, we captured only buffered 


data timing information ; no network addresses were ever 
recorded. We discussed our experimental methodology 
with Tor Project maintainers, who raised no objections. 
Finally, we contacted the IRB of our relay host institu- 
tion. The IRB decided that no review was warranted 
since our measurements did not, in their opinion, con- 
stitute human subjects research. 

4.2 Congestion in a Shadow-Tor Network 

While congestion data from real live relays is the most 
accurate, it only gives us a limited view of congestion 
local to our relays. The congestion measured at our re- 
lays may or may not be representative of congestion at 
other relays in the network. Therefore, we use our pri- 
vate Shadow-Tor network to supplement our congestion 
data and enhance our analysis. Using Shadow provides 
many advantages over live Tor: it’s technically simpler; 
we are able to measure congestion at all relays in our pri- 
vate network; we can track the congestion of every cell 
across the entire circuit because we do not have privacy 
concerns with Shadow; and we can analyze how conges- 
tion changes with varying network configurations. 

Tor and Kernel Congestion: The process for collect- 
ing congestion in Shadow is simpler than in live Tor, 
since we have direct access to Shadow’s virtual kernel. 
In our modified Tor, each cell again contains a unique ID 
as in Section 4.1. However, when running in Shadow, 
we also add a 16 byte magic token and include both the 
unique ID and the magic token when sending cells out to 
the network. The unique ID is forwarded with the cell 
as it travels through the circuit. Since Shadow prevents 
Tor from encrypting cell contents for efficiency reasons, 
the Shadow kernel can search outgoing packets for the 
unencrypted magic token immediately before they leave 
the virtual network interface. When found, it logs the 
unique cell ID with a timestamp. It performs an anal- 
ogous procedure for incoming packets immediately af- 
ter they arrive on the virtual network interface. These 
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Figure 4: Relay congestion by circuit position in our Shadow-Tor network, measured on on circuits using end-to-end cells from 
1,200 clients selected uniformly at random. Most congestion occurred in the outbound kernel queues, independent of relay position. 


Shadow timestamps are combined with the timestamps 
logged when a cell enters and leaves Tor to compute both 
Tor and kernel congestion. 

Results: We use our model of Tor as described in Sec- 
tion 3, with the addition of the cell tracking information 
discussed above. Since tracking every cell would con- 
sume an extremely large amount of disk space, we sam- 
ple congestion as follows: we select 10 percent of the 
non-TorPerf clients (1,200 total) in our network chosen 
uniformly, and track 1 of every 100 cells traveling over 
circuits they initiate. The tracking timestamps from these 
cells are then used to attribute congestion to the relays 
through which the cells are traveling. 

It is important to understand that our method does not 
sample relay congestion uniformly: the congestion mea- 
surements will be biased towards relays that are cho- 
sen more often by clients, according to Tor’s bandwidth- 
weighted path selection algorithm. This means that our 
results will represent the congestion that a typical client 
will experience when using Tor. We believe that these 
results are more meaningful than those we could obtain 
by uniformly sampling congestion at each relay indepen- 
dently (as we did in Section 4.1), because ultimately we 
are interested in improving clients’ experience. 

The distributions of congestion measured in Shadow 
for each circuit position are shown in Figure 4. We again 
find that congestion occurs most significantly in the ker- 
nel outbound queues, regardless of a relay’s circuit po- 
sition. Our Shadow experiments indicate higher conges- 
tion than in live Tor, which we attribute to our client- 
oriented sampling method described above. 

5 Kernel-Informed Socket Transport 

Our large scale congestion analysis from Section 4 re- 
vealed that the most significant delay in Tor occurs in 
outbound kernel queues. In this section, we first explore 
how this problem adversely affects Tor’s traffic manage- 
ment by disrupting existing scheduling mechanisms to 


the extent that they become ineffective. We then describe 
the KIST algorithm and experimental results. 

5.1 Mismanaged Socket Output 

As described in Section 2, each Tor relay creates and 
maintains a single TCP connection to every relay to 
which it is connected. All communication between two 
relays occurs through this single TCP connection chan- 
nel. In particular, this channel multiplexes all circuits 
that are established between its relay endpoints. TCP 
provides Tor a reliable and in-order data transport. 
Sequential Socket Writes: Tor uses the asynchronous 
event library libevent [4] to assist with sending and 
receiving data to and from the kernel (i.e. network). Each 
TCP connection is represented as a socket in the kernel, 
and is identified by a unique socket descriptor. Tor reg- 
isters each socket descriptor with libevent, which itself 
manages kernel polling and triggers an asynchronous no- 
tification to Tor via a callback function of the readability 
and writability of that socket. When Tor receives this 
notification, it chooses to read or write as appropriate. 

An important aspect of these libevent notifications is 
that they happen for one socket at a time , regardless of 
the number of socket descriptors that Tor has registered. 
Tor attempts to send or receive data from that one socket 
without considering the state of any of the other sock- 
ets. This is particularly troublesome when writing, as 
Tor will only be able to choose from the non-empty cir- 
cuits belonging to the currently triggered socket and no 
other. Therefore, Tor’s circuit scheduler may schedule a 
circuit with worse priority than it would have if it could 
choose from all sockets that are able to be triggered at 
that time. Since the kernel schedules with a first-come 
first-serve (FCFS) discipline. Tor may actually be send- 
ing data out of priority order simply due to the order in 
which the socket notifications are delivered by libevent. 
Bloated Socket Buffers: Finux uses TCP auto- 

tuning to dynamically and monotonically increase each 
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Figure 5: Socket sharing affects circuit priority scheduling. 


socket buffer’s capacity using the socket connection’s 
bandwidth-delay product calculation [61]. TCP auto- 
tuning increases the amount of data the kernel will accept 
from the application in order to ensure that the socket is 
able to fully utilize the network link. TCP auto-tuning 
is an extremely useful technique to maximize throughput 
for applications with few sockets or without priority re- 
quirements. However, it may cause problems for more 
complex applications like Tor. 

When libevent notifies Tor that a socket is writable. 
Tor writes as much data as possible to that socket (i.e., 
until the kernel returns EWOULDBLOCK). Although this 
improves utilization when only a few auto-tuned sockets 
are in use, consider a Tor relay that writes to thousands of 
auto-tuned sockets (a common situation since Tor main- 
tains a socket for every relay with which it communi- 
cates). These sockets will each attempt to accept enough 
data to fully utilize the link. If Tor fills all of these sock- 
ets to capacity, the kernel will clearly be unable to imme- 
diately send it all to the network. Therefore, with many 
active sockets in general and for asymmetric connections 
in particular, the potential for kernel queuing delays are 
dramatic. As we have shown in Section 4, writing as 
much as possible to the kernel as Tor currently does re- 
sults in large kernel queuing delays. 

Tor can no longer adjust data priority once it is sent 
to the kernel, even if that data is still queued in the ker- 
nel when Tor receives data of higher importance later. 
To demonstrate how this may result in poor scheduling 
decisions, consider a relay with two circuits: one con- 
tains sustained, high throughput traffic of worse prior- 
ity (typical of many bulk data transfers), while the other 
contains bursty, low throughput traffic of better priority 
(typical of many interactive data transfer sessions). In 
the absence of data on the low throughput circuit, the 
high throughput circuit will fill the entire kernel socket 
buffer whether or not the kernel is able to immediately 
send that data. Then, when a better priority cell arrives, 
Tor will immediately schedule and write it to the kernel. 
However, since the kernel sends data to the network in 


the same order in which it was received from the appli- 
cation (FCFS), that better priority cell data must wait un- 
til all of the previously received high throughput data is 
flushed to the network. This problem theoretically wors- 
ens as the number of sockets increase, suggesting that 
recent research proposing that Tor use multiple sockets 
between each pair of relays [16,26] may be misguided. 
Effects on Circuit Priority: To study the effects on cir- 
cuit priority, we customized Tor as follows. First, we 
added the ability for each client to send a special cell af- 
ter building a circuit that communicates one of two prior- 
ity classes to the circuit’s relays: a better priority class; or 
a worse priority class. Second, we customized the built- 
in EWMA circuit scheduler that prioritizes bursty traffic 
over bulk traffic [57] to include a priority factor JF: the 
circuit scheduler counts & cells for every cell scheduled 
on a worse priority circuit. Therefore, the EWMA of 
the worse priority class will effectively increase & times 
faster than normal, giving a scheduling advantage to bet- 
ter priority traffic. 

We experiment with two separate private Tor net- 
works: one using Shadow [35], a discrete event network 
simulator that runs Tor in virtual processes; and the other 
using DETER [3], a custom experimentation testbed that 
runs Tor on bare-metal hardware. We consider two 
clients downloading from two file servers through Tor in 
the scenarios shown in Figure 5a: 

- shared socket: the clients share entry and middle re- 
lays, but use different exit relays - the clients’ circuits 
each belong to the same socket connecting the middle 
to the entry; and 

- unshared sockets: the clients share only the middle 
relay - the clients’ circuits each belong to indepen- 
dent sockets connecting the middle to each entry. 

We assigned one client’s traffic to the better priority 
class (denoted with “+”) and the other client’s traffic to 
the worse priority class (denoted with “-”). We config- 
ured all nodes with a 10 Mbit symmetric access link, 
and approximated a middle relay bottleneck by setting its 
socket buffer size to 32 KiB. Our configuration allows us 
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to focus on the socket contention that will occur at the 
middle relay, and the four cases that result when con- 
sidering whether or not the two circuits share incoming 
or outgoing TCP connections at the middle relay. Clients 
downloaded data through the circuits continuously for 10 
minutes in Shadow and 60 minutes on DETER. 6 

The results collected during each of the scenarios are 
shown in Figure 5. Plotted is the cumulative distribu- 
tion of the throughput achieved by the better (“pri+”) and 
worse (“pri-”) priority clients using the priority sched- 
uler, as well as the combined cumulative distribution for 
both clients using Tor’s default round-robin scheduler 
(“rr”). As shown in Figure 5b, performance differenti- 
ation occurs correctly with the priority scheduler on a 
shared socket. However, as shown in Figure 5c, the pri- 
ority scheduler is unable to differentiate throughput when 
the circuits do not share a socket. 

Discussion: As outlined above, the reason for no differ- 
entiation in the case of the unshared socket is that both 
circuits are treated independently by the scheduler due to 
the sequential libevent notifications and the fact that Tor 
currently schedules circuits belonging to one socket at a 
time while ignoring the others. We used TorPS [10], a 
Tor path selection simulator, to determine how often we 
would expect unshared sockets to occur in practice. We 
used TorPS to build 10 million paths following Tor’s path 
selection algorithm, and computed the probability of two 
circuit paths belonging to each scenario. We found that 
any two paths may be classified as unshared (they share 
at least one relay but never share an outgoing socket) at 
least 99.775 percent of the time, clearly indicating that 
adjusting Tor’s socket management may have a dramatic 
effect on data priority inside of Tor. 

Note that the socket mismanagement problem is not 
solved simply by parallelizing the libevent notification 
system and priority scheduling processes (which would 
require complex code), or by utilizing classful queuing 
disciplines in the kernel (which would require root priv- 
ileges); while these may improve control over traffic pri- 
ority to some extent, they would still result in bloated 
buffers containing data that cannot be sent due to closed 
TCP congestion windows. 

5.2 The KIST Algorithm 

In order to overcome the inefficiencies resulting from 
Tor’s socket management, KIST chooses between all cir- 
cuits that have queued data irrespective of the socket to 
which the circuit belongs, and dynamically adjusts the 
amount written to each socket based on real-time kernel 
information. We now detail each of these approaches. 

6 The small-scale experiments described here are meant to isolate 
Tor’s internal queuing behavior for analysis purposes, and do not fully 
represent the live Tor network, its background traffic, or its load. 


Algorithm 1 The KIST NotifySocketWritable ( ) 

callback, invoked by libevent for each writable socket. 
Require: sdesc,conn, ST -t— GlobalW riteTimeout 
1: L p <r- getPendingConnectionListQ 
2: if L p is Null then 
3: Lp newList () 

4: setPendingConnectionList{L p ) 

5: creafeCa///7flck(^’,NotifyGlobalWrite ( ) ) 

6 : end if 

7: if L p .contains(conn) is False then 
8: L p .add(conn ) 

9: end if 

10: disableNotify(sdesc) 


Global Circuit Scheduling: Recall that libevent delivers 
write notification events for a single socket at a time. Our 
approach with KIST is relatively straightforward: rather 
than handle the kernel write task immediately when 
libevent notifies Tor that a socket is writable, we simply 
collect a set of sockets that are writable over a time inter- 
val specified by an adjustable GlobalWriteTimeout 
parameter. This allows us to increase the number of can- 
didate circuits we consider when scheduling and writ- 
ing cells to the kernel: we may select among all circuits 
which contain cells that are waiting to be written to one 
of the sockets in our writable set. 

The socket collection approach is outlined in Algo- 
rithm 1. The socket descriptor sdesc and a connection 
state object conn are supplied by libevent. Note that 
we disable notification events for the socket (as shown in 
line 10) in order to prevent duplicate notification events 
during the socket collection interval. 

After the GlobalWriteTimeout time interval, 
KIST begins writing cells to the sockets according to the 
circuit scheduling policy. There are two major phases 
to this process, which is outlined in Algorithm 2. In 
lines 4 and 8, we distinguish sockets that contain raw 
bytes ready to be written directly to the kernel (previ- 
ously scheduled cells with TFS headers attached) from 
those with additional cells ready to be converted to raw 
bytes. KIST first writes the already scheduled raw bytes 
(lines 4-7), and then schedules and writes additional cells 
after converting them to raw bytes and adding TFS head- 
ers (lines 13-15). Note that the connections should be 
enumerated (on line 3 of Algorithm 2) in an order that 
respects the order in which cells were converted to raw 
bytes by the circuit scheduler in the previous round. 

The global scheduling approach does not by itself 
solve the bloated socket buffer problem. KIST also dy- 
namically computes socket write limits on line 2 of Al- 
gorithm 2 using real-time TCP, socket, and bandwidth in- 
formation, which it then uses when deciding how much 
to write to the kernel. 
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Algorithm 2 The KIST NotifyGlobalWrite ( ) call- 
back, invoked after the GlobalWriteTimeout period. 

1: League 4 newListQ 

2: K 4— collectKernelInfo(getConnectionList ( )) 

3: for all conn in get PendingConnectionList ( ) do 
4: if hasBytesForKernel(conn) is True then 

5: enableNotify(conn) 

6: nBytes 4— writeBytesToKernel (K . conn) 

7: end if 

B: if hasCells(conn ) is True and 

get Limit (K, conn) > 0 then 
9: L e i igib i e . add (conn) 

10: end if 

1 1 : end for 

12: while LgHgihie . is Em pty ( ) is False do 

13: conn <— scheduleCell(L e u g i b i e ) {cell to bytes} 

14: enableN otify(conn) 

15: nBytes 4— writeBytesToKernel(K,conn) 

16: if nBytes is 0 or get Limit (K, conn) is 0 then 

1 7 : Leiigibie ■ remove ( conn ) 

18: end if 

19: end while 


Managing Socket Output: KIST attempts to move the 
queuing delays from the kernel outbound queue to Tor’s 
circuit queue by keeping kernel output buffers as small 
as possible, i.e., by only writing to the kernel as much 
as the kernel will actually send. By delaying the circuit 
scheduling decision until the last possible instant before 
kernel starvation occurs. Tor will ultimately improve its 
control over the priority of outgoing data. This approach 
attempts to give Tor approximately the same control over 
outbound data that it would have if it had direct access to 
the network interface. When combined with global cir- 
cuit scheduling. Tor’s influence over outgoing data prior- 
ity should improve. 

To compute write limits, KIST first makes three sys- 
tem calls for each connection: getsockopt on level 
SOL_SOCKET for option SOJSNDBUF to get sndbufcap, 
the capacity of the send buffer; ioctl with command 
SIOCOUTQ to get sndbuflen, the current length of the 
send buffer; and getsockopt on level SOL_TCP for 
option TCP_INFO to get tcpi, a variety of TCP state in- 
formation. The TCP information used by KIST includes 
the connection’s maximum segment size mss, the con- 
gestion window cwnd, and the number of unacked pack- 
ets for which the kernel is waiting for an acknowledg- 
ment from the TCP peer. KIST then computes a write 
limit for each connection c as follows: 

socket space c = sndbufcap c — sndbuflen c 
tcpspace c = (cwnd c — unacked c ) ■ mss c (1) 
limit c = min (socket space c , tcpspacef) 


The key insight in Equation 1 is that TCP will not al- 
low the kernel to send more packets than dictated by the 
congestion window, and that the unacknowledged pack- 
ets prevent the congestion window from sliding open. By 
respecting this write limit for each connection, KIST en- 
sures that the data sent to the kernel is immediately send- 
able and reduces kernel queuing delays. 

If all connections are sending data in parallel, it is still 
possible to overwhelm the kernel with more data than 
it can physically send to the network. Therefore, KIST 
also computes a global write limit at the beginning of 
each GlobalWriteTimeout period: 

sndbuflen prev = sndbuflen 
sndbuflen = £ e . (sndbuflen Cj ) 
bytes sent = sndbuflen — sndbuflen prev 
limit = max(limit, bytes sent) 

Note that Equation 2 is an attempt to measure the actual 
upstream bandwidth speed of the machine. In practice, 
this could be done in a testing phase during which writes 
are not limited, configured manually, or estimated using 
other techniques such as packet trains [32], 

The connection and global limits are computed at the 
beginning of a scheduling round, i.e., on line 2 of Algo- 
rithm 2; they are enforced whenever bytes are written to 
the kernel, i.e., on lines 6 and 15 of Algorithm 2. Note 
that they will be bounded above by Tor’s independently 
configured connection and global application rate limits. 

5.3 Experiments and Results 

We use Shadow and its models as discussed in Section 3 
to measure KIST’s effect on network performance, con- 
gestion, and throughput. We also evaluate its CPU over- 
head. See Appendix C [34] for an analysis under a more 
heavily loaded Shadow-Tor network. Note that we found 
that KIST performs as well or better under heavier load 
than under normal load as presented in this section, indi- 
cating that it can gracefully scale as Tor grows. 
Prototype: We implemented a KIST protoype as a patch 
to Tor version 0.2.3.25, and included the elements 
discussed in Section 4 necessary for measuring conges- 
tion during our experiments. We tested vanilla Tor us- 
ing the default CircuitPriorityHalf lif e of 30, 
the global scheduling part of KIST (without enforcing 
the write limits), and the complete KIST algorithm. We 
configured the global scheduler to use a 10 millisecond 
GlobalWriteTimeout in both the global and KIST 
experiments. Note that our KIST implementation ignores 
the connection enumeration order on line 3 of Algo- 
rithm 2, an optimization that may further improve Tor’s 
control over priority in cases where the global limit is 
reached before the algorithm reaches line 12. 
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(a) Kernel Out Congestion (b) Tor Congestion (c) Circuit Congestion 

Figure 6: Congestion for vanilla Tor, KIST, and the global scheduling part of KIST (without enforcing write limits). Figures 6a 
and 6b show the distribution of cell congestion local to each relay (with logarithmic x-axes), while Figure 6c shows the distribution 
of the end-to-end circuit congestion for all measured cells. 



Figure 7: Client performance for vanilla Tor, KIST, and the global scheduling part of KIST (without enforcing write limits). 
Figure 7a shows the distribution of the time until the client receives the first byte of the data payload, for all clients, while the inset 
graph shows the same distribution with a logarithmic x-axis. Figures 7b and 7c show the distribution of time to complete a 320 KiB 
and 5 MiB Hie by the “web” and “bulk” clients, respectively. 


Congestion: Recall that the goal of KIST is to move con- 
gestion from the kernel outbound queue to Tor where it 
can better be managed. Figure 6 shows KIST’s effective- 
ness in this regard. In particular. Figure 6a shows that 
KIST reduces kernel outbound congestion over vanilla 
Tor by one to two order of magnitude for over 40 percent 
of the sampled cells. Further, it shows that the queue 
time is less than 200 milliseconds for 99 percent of the 
cells measured, compared to over 4000 milliseconds for 
both vanilla Tor and global scheduling alone. 

Figure 6b shows how global scheduling and KIST in- 
crease the congestion inside of Tor. Both global schedul- 
ing and KIST result in sharp Tor queue time increases 
up to 10 milliseconds, after which the existing 10 mil- 
lisecond GlobalWriteTimeout timer event will fire 
and Tor will flush more data to the kernel. With global 
scheduling, most of the data queued in Tor quickly gets 
transferred to the kernel following this timeout, whereas 
data is queued inside of Tor much longer when using 
KIST. This result is an explicit feature of KIST, as it 
means Tor will have more control over data priority when 
scheduling circuits. 


While we have shown above how KIST is able to move 
congestion from the kernel into Tor, Figure 6c shows the 
aggregate effect on cell congestion during its complete 
existence through the entire end-to-end circuit. KIST re- 
duces aggregate circuit congestion from 1010.1 millisec- 
onds to 704.5 milliseconds in the median, a 30.3 percent 
improvement, while global scheduling reduces conges- 
tion by 13 percent to 878.8 milliseconds. 

The results in Figure 6 show that KIST indeed 
achieves its congestion management goals while high- 
lighting the importance of limiting kernel write amounts 
in addition to globally scheduling circuits. 
Performance: We show in Figure 7 how KIST af- 
fects client performance. Figure 7a shows how net- 
work latency is generally affected by showing the time 
until the first byte of every download by all clients. 
Global scheduling alone is roughly indistinguishable 
from vanilla Tor, while KIST reduces latency to the first 
byte for over 80 percent of the downloads — in the me- 
dian, KIST reduces network latency by 18. 1 percent from 
0.838 seconds to 0.686 seconds. The inset graph has a 
logarithmic x-axis and shows that KIST is particularly 
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Figure 8: Aggregate relay write throughput for vanilla Tor, 
KIST, and the global scheduling part of KIST (without enforc- 
ing write limits). Both of our enhancements increase network 
throughput over vanilla Tor. 

beneficial in the upper parts of the distribution: in the 
99th percentile, latency is reduced from more than 7 sec- 
onds to less than 2.7 seconds. 

Figures 7b and 7c show the distribution of time to 
complete each 320 KiB download for the “web” clients 
and each 5 MiB file for the “bulk” clients, respectively. 
In our experiments, the 320 KiB download times de- 
creased by over 1 second for over 40 percent of the down- 
loads, while the download times for 5 MiB files increased 
by less than 8 seconds for all downloads. These changes 
in download times are a result of Tor correctly utiliz- 
ing its circuit priority scheduler, which prioritizes traffic 
with the lowest exponentially-weighted moving average 
throughput. As the “web” clients pause between down- 
loads, their traffic is often prioritized ahead of “bulk” 
traffic. Our results indicate that not only does KIST de- 
crease Tor network latency, it also increases Tor’s ability 
to appropriately manage its traffic. 

Throughput: We show in Figure 8 KIST’s effect on 
relay throughput. Shown is the distribution of aggre- 
gate bytes written per second by all relays in the net- 
work. We found that throughput improves when using 
KIST due to a combination of the reduction in network 
latency and our client model: web clients completed their 
downloads faster in the lower latency network and there- 
fore also downloaded more files. By lowering circuit 
congestion, KIST improves utilization of existing band- 
width resources over vanilla Tor by 7 1 .6 MiB/s, or 9.8%, 
in the median. While the best network utilization is 
achieved with global scheduling without write limits (a 
150.1 MiB/s, or 20.5%, improvement over vanilla Tor in 
the median), we have shown above that it is less effective 
than KIST at reducing kernel congestion and allowing 
Tor to correctly prioritize traffic. 

Overhead: The main overhead in KIST involves the 
collection of socket and TCP information from the ker- 
nel using three separate calls to getsockopt (socket 
capacity, socket length, and TCP info). These three 
system calls are made for every connection after ev- 



Figure 9: Network size correlates with performance. 

ery GlobalWriteTimeout interval. To understand 
the overhead involved with these calls, we instrumented 
curiosity3 from Section 4 to collect timing information 
while performing the syscalls required by KIST. Our test 
ran on the live relay running an Intel Xeon x3450 CPU 
at 2.67GHz for 3 days and 14 hours, and collected a tim- 
ing sample every second for a total of 309,739 samples. 
We found that the three system calls took 0.9140 mi- 
croseconds per connection in the median, with a mean 
of 0.9204 and a standard deviation of 3.1 x 10 5 . 

The number of connections a relay may have is 
bounded above roughly by the number of relays in the 
network, which is currently around 5,000. Therefore, we 
expect the overhead to be less than 5 milliseconds and 
reasonable for current relays. If this overhead becomes 
problematic as Tor grows, the gathering of kernel infor- 
mation can be outsourced to a helper thread and continu- 
ously updated over time. Further, we have determined 
through discussions with Linux kernel developers that 
the netlink socket diag interface could be used to collect 
information for several sockets at once — an optimization 
that may provide significant reductions in overhead. 

6 Security Analysis 

Performance and Security: Performance and ease of 
use affect adoption rates of any network technology. 
They have played a central role in the size and diversity 
of the Tor userbase. This can then affect the size of the 
network itself as users are more willing to run parts of the 
network or contribute financially to its upkeep, e.g., via 
torservers.net. Growth from performance improvements 
affect the security of Tor by increasing the uncertainty for 
many types of adversaries concerning who is communi- 
cating with whom [17,20,22,37]. Performance factors 
in anonymous communication systems like Tor are thus 
pertinent to security in a much more direct way than they 
typically would be for, say, a faster signature algorithm’s 
impact on the security of an authentication system. 

Though real and more significant, direct effects of per- 
formance on Tor’s security from network and userbase 
growth are also hard to show, given both the variety of 
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(b) Cumulative Pings until Best Estimate 


Figure 10: Latency leaks are more pronounced (10a) and are faster (10b) with KIST. 


causal factors and the difficulty of gathering useful data 
while preserving privacy. Whatever the causal explana- 
tion, a correlation (—.62) between Tor performance im- 
provement over time (measured by the median download 
time of a 1 MB file) and network size is shown in Fig- 
ure 9. (Numbers are from the Tor Metrics Portal [8].) 
Similar results hold when number of relays is replaced 
with bandwidth metrics. Which is more relevant depends 
on the adversary’s most significant constraints: adver- 
sary size and distribution across the underlying network 
are important considerations [39]. 

More measurable effects can occur if a performance 
change creates a new opportunity for attack or makes an 
existing attack more effective or easier to mount. Perfor- 
mance change may also eliminate or diminish previous 
possible or actual attacks. Growth effects are potentially 
the greatest security effects of our performance changes, 
but we now focus on these more directly observable as- 
pects. They include attacks on Tor based on resource 
contention or interference [19, 24, 25, 31, 44, 46, 48] or 
simply available resource observation [44], or observ- 
ing other performance properties, such as latency [31], 
Many papers have also explored improvements to Tor 
performance via scheduling, throttling, congestion man- 
agement, etc. (see Section 2). Manipulating performance 
enhancement mechanisms can turn them into potential 
vectors of attack themselves [38], Geddes et al. [25] 
analyzed anonymity impact of several performance en- 
hancement mechanisms for Tor. 

Latency Leak: The basic idea of a latency leak attack 
as first set out in [30] is to measure RTT (roundtrip time) 
between a compromised exit and the client of some tar- 
get connection repeatedly and then to pick the shortest 
result as an indication of latency. Next compare this to 
the known, measured latency through all the hops in the 
circuit except client to entry relay. (Other attacks such as 
throughput measurement, discussed below, are assumed 
to have already identified the relays in the circuit.) Next, 


use that to determine the latency between the client of 
the target connection and its entry relay, which is in a 
known network location. This can significantly reduce 
the range of possible network locations for the client. 
When measuring latency using our improved models and 
simulator, we discovered that this attack is generally able 
to determine latency well with vanilla Tor. While KIST 
improves the overall accuracy, the improvement is small 
when a good estimate was also found with vanilla Tor. 

Figure 10a shows the results of an experiment run on 
our model from Section 3 with random circuit and client 
choices, indicating the difference between the correct la- 
tency and the estimate after a few hundred pings once per 
second. Roughly 20% of circuits for both vanilla Tor and 
KIST are within 25ms of the correct latency. After this 
they diverge, but both have a median latency estimate 
of about 50ms or less. It is only for the worst 10-20% 
of estimates, which are presumably not useful anyway, 
that KIST is substantially better. While the eventual ac- 
curacy of the attack is comparable for both, the attacker 
under KIST is significantly faster on average. Figure 10b 
shows the cumulative number of pings (seconds) until a 
best estimate is achieved. After 200 pings, nearly 40% 
of KIST circuits have achieved their best estimate while 
less than 10% have for vanilla Tor. And the median num- 
ber of pings needed for KIST is about 700 vs. 1200 for 
vanilla Tor. 

The accuracy of the latency attack indicated above is a 
significant threat to network location, which from a tech- 
nical perspective is what Tor is primarily designed to pro- 
tect. It could be diminished by padding latency. Specif- 
ically any connection at the edges of the Tor network, 
at either source or destination end, could be dynamically 
padded by the entry or exit relay respectively to ideally 
make latency of all edge connections through that re- 
lay uniform — more realistically to significantly decrease 
the network location information leaked by latency. Re- 
lays can do their own RTT measurements for any edge 
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(a) Throughput Correlation, Probe to Target Client (b) Guards Remaining in Candidate Guard Set 


Figure 11: While the aggregate throughput correlations of the probe to the true guard over the set of all probes (11a) are not 
significantly affected by KIST, it is slightly easier for the adversary to eliminate candidate guards of a target client (for all clients) 
(1 lb) when using KIST. 


connection and pad accordingly. There are many issues 
around this suggestion, which we leave to future work. 
Throughput Leak: The throughput attack introduced by 
Mittal et al. [44] identifies the entry relay of a target con- 
nection by setting up one-hop probe circuits from attack 
clients through all prospective entry relays and back to 
the client in order to measure throughput at those relays. 
These throughputs are compared to the throughput ob- 
served by the exit relay of the target circuit. The attack is 
directed against circuits used for bulk downloading since 
these will attempt a sustained maximum throughput, and 
will result in congestion effects on bottleneck relays that 
allow the adversary to reduce uncertainty about possible 
entry relays. Mittal et al. also looked at attacks on lower 
bandwidth interactive traffic and found some success, al- 
though with much less accuracy than for bulk traffic. 

We analyze the extent to which KIST affects the 
throughput attack. While measuring throughput at en- 
try relays, we also adopt the simplification of Geddes et 
al. [25] of restricting observations to entry relays that are 
not used as middle or exit relays for other bulk download- 
ing circuits. This allows us the efficiency of making mea- 
surements for several simulated attacks simultaneously 
while minimizing interference between their probes. 

Figure 11a shows the cumulative distribution of scores 
for correlation of probe throughput at the correct entry 
relay with throughput at the observed exit relay under 
vanilla Tor and under KIST scheduling (on the network 
and user model given in Section 3). Throughput was 
measured every 100 ms. We found that the throughput 
correlations are not significantly affected by KIST. 

To explain the correlation scores, recall from Sec- 
tion 5.2 how KIST reduces both circuit congestion and 
network latency by allowing Tor to properly prioritize 
circuits independent of the TCP connections to which 
they belong. This leads to two competing potential ef- 


fects on the throughput attack: (1) a less congested net- 
work will increase the sensitivity of the probes to vari- 
ations in throughput, thereby allowing stronger corre- 
lations between the throughput achieved by the probe 
client and that achieved by the target client; and (2) a cir- 
cuit’s throughput is most correlated with that of its bottle- 
neck relay, and KIST’s improved scheduling should also 
reduce the bottleneck effects of congestion in the net- 
work and allow weaker throughput correlations. Further, 
the improved priority scheduling (moving from round- 
robin over TCP connections to properly utilizing EWMA 
over circuits) will cause the throughput of each client to 
become slightly “burstier” over the short term as the pri- 
ority causes the scheduler to oscillate between the cir- 
cuits. We suspect that the similar correlation scores are 
the result of combining these effects. 

To further understand KIST’s affect on the throughput 
attack, we measure how the correlation of every client’s 
throughput to the true guard’s throughput compares to 
the correlation of the client’s throughput to that of every 
other candidate guard in the network. For every client, 
we start with a candidate guard set of all guards, and re- 
move those guards with a lower correlation score with 
the client than the true guard’s score. Figure 1 lb shows 
the distribution, over all clients, of the extent to which 
we were able to reduce the size of the candidate guard 
set using this heuristic. Although KIST reduced the un- 
certainty about the true guard used by the target client, 
we do not expect the small improvement to significantly 
affect the ability to conduct a successful throughput at- 
tack in practice. 

7 Conclusion 

In this paper, we outlined the results of an in-depth 
congestion study using both public and private Tor net- 
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works. We identified that most congestion occurs in 
outbound kernel buffers, analyzed Tor socket manage- 
ment, and designed a new socket transport mechanism 
called KIST. Through evaluation in a full-scale private 
Shadow-Tor network, we conclude that KIST is capa- 
ble of moving congestion into Tor where it can be bet- 
ter managed by application priority scheduling mecha- 
nisms. More specifically, we found that by considering 
all sockets and respecting TCP state information when 
writing data to the kernel, KIST reduces both conges- 
tion and latency while increasing utilization. Finally, we 
performed a detailed evaluation of KIST against well- 
known latency and throughput attacks. While KIST in- 
creases the speed at which true network latency can be 
calculated, it does not significantly affect the accuracy of 
the probes required to correlate throughput. 

Future work should extend our simulation-based eval- 
uation and consider how KIST performs for relays in 
the live Tor network. We note that our analysis is 
based exclusively on Linux relays, as 91% of Tor’s band- 
width is provided by relays running a Linux-based dis- 
tribution [58], Although we expect KIST to improve 
performance similarly across platforms because it pri- 
marily works by managing socket buffer levels, future 
work should consider how KIST is affected by the inter- 
operation of relays running on a diverse set of OSes. Fi- 
nally, our KIST prototype would benefit from optimiza- 
tions, particularly by running the process of gathering 
kernel state information in a separate thread and/or us- 
ing the netlink socket diag interface. 
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